Gene mutation split | Regex string

Girish Venkataraman

Join Date: Dec 2021

Posts: 293
#1

Gene mutation split | Regex string

17 Dec 2022, 09:01

Hello all,

I have a column of co-occurring mutations by each observation and need to split this into multiple columns, each split corresponding to one mutation per obs. I managed to split the first and second/last but I am stumbling on going further picking the first instance of a ")".
The string is too big for dataex to export but a sample of the column 'Cooccurmutn' is like this. The pattern I find useful is:
gene names are always upper case starting letter and

follow either a lower case letter ending or a ")".

DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)
CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss
SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal

gen com = ustrregexs(0) if ustrregexm(Cooccurmutn, "(^[A-Z].*[a-z]$)") // first comutation
replace com = ustrregexs(0) if ustrregexm(Cooccurmutn, "^[A-Z].*([a-z]|\))$") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*[a-z]{4})([A-Z]*.*)") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))([A-Z]*.*)") & missing(com) // first

gen com2 =ustrregexra(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))","") // second through end

What I want for the first row/obs would be:
comut1 comut2 comut3
DNMT3A.* | IDH2.* |JAK2.*

Alternatively, I had a file with master list of all human genes (hugolist.dta) which I was hoping to match and extract by looking up each row in my file to the column of genes in the genelist file, but don't know if that is easier in R or perhaps Stata. I am rusty with text functions in R though.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Dec 2022, 10:52

I don't really understand what you seek. What does DNMT3A.* | IDH2.* |JAK2.* mean in the context of what you seek? Can you not just copy and paste EXACTLY the values you want in your comut variables to help us understand what you are looking for?

Perhaps this will start you in a useful direction - the first step is to split each string on each right parenthesis.

Code:

input str300 text
"DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)"
"CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss"
"SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal"
end
generate seq = _n
split text, generate(maybe) parse(")")
local maybes `r(varlist)'
foreach v of local maybes {
    replace `v' = ustrregexrf(`v'," .*","")
}
reshape long maybe, i(seq) j(which)
drop if maybe==""
bysort seq (which): replace which = _n
reshape wide maybe, i(seq) j(which)
list seq maybe*, noobs clean

Code:

. list seq maybe*, noobs clean

    seq   maybe1   maybe2   maybe3   maybe4  
      1   DNMT3A     IDH2     JAK2           
      2     CUX1                             
      3    SF3B1      WT1      ATR    STK11

Comment

Girish Venkataraman

Join Date: Dec 2021

Posts: 293
#3

17 Dec 2022, 11:12

Originally posted by William Lisowski View Post

I don't really understand what you seek. What does DNMT3A.* | IDH2.* |JAK2.* mean in the context of what you seek? Can you not just copy and paste EXACTLY the values you want in your comut variables to help us understand what you are looking for?

Perhaps this will start you in a useful direction - the first step is to split each string on each right parenthesis.

Code:

input str300 text "DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)" "CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss" "SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal" end generate seq = _n split text, generate(maybe) parse(")") local maybes `r(varlist)' foreach v of local maybes { replace `v' = ustrregexrf(`v'," .*","") } reshape long maybe, i(seq) j(which) drop if maybe=="" bysort seq (which): replace which = _n reshape wide maybe, i(seq) j(which) list seq maybe*, noobs clean

Code:

. list seq maybe*, noobs clean seq maybe1 maybe2 maybe3 maybe4 1 DNMT3A IDH2 JAK2 2 CUX1 3 SF3B1 WT1 ATR STK11

Sorry for the confusion, William Lisowski. And thanks much for that code. What I really wanted at the end was (just first co-mutation for first case depicted below):

Comut1 Comut1type Comut1vaf

DNMT3A c.2645G>A, p.R882H (NM_022552.4) 38

Even if I could split these three major chunks/gene, I should be able to split the smaller bits more easily. It was the second one with the lower case letter following the gene that threw me off with trying to split the gene chunks per observation. The parse seems to skip the second entry which does not have any ")".

Last edited by Girish Venkataraman; 17 Dec 2022, 11:25. Reason: Accidental post before completion
Comment

Announcement

Gene mutation split | Regex string

Comment

Comment